For this project, we will be working with a mental health dataset obtained from a previously completed ADHD research project. The dataset contains information from 175 individuals who took part in the study and includes 54 columns broken into the following categories:
| Excel Column | Variable | Key |
|---|---|---|
| C | Sex | Male-1, Female-2 |
| D | Race | White-1, African American-2, Hispanic-3, Asian-4, Native American-5, Other or missing data -6 |
| E-W | ADHD self-report scale | Never-0, rarely-1, sometimes-2, often-3, very often-4 |
| X-AM | Mood disorder questions | No-0, yes-1; question 3: no problem-0, minor-1, moderate-2, serious-3 |
| AN-AS | Individual substances misuse | no use-0, use-1, abuse-2, dependence-3 |
| AT | Court Order | No-0, Yes-1 |
| AU | Education | 1-12 grade, 13+ college |
| AV | History of Violence | No-0, Yes-1 |
| AW | Disorderly Conduct | No-0, Yes-1 |
| AX | Suicide attempt | No-0, Yes-1 |
| AY | Abuse History | No-0, Physical (P)-1, Sexual (S)-2, Emotional (E)-3, P&S-4, P&E-5, S&E-6, P&S&E-7 |
| AZ | Non-substance-related Dx | 0 – none; 1 – one; 2 – More than one |
| BA | Substance-related Dx | 0 – none; 1 – one Substance-related; 2 – two; 3 – three or more |
| BB | Psychiatric Meds | 0 – none; 1 – one psychotropic med; 2 – more than one psychotropic med |
Our task is first to explore the data, then using the understanding gained, to build a clustering model to determine different segments of patients. Additionally, we’ll perform Principal Component Analysis on a subset of the dataset to see what information can be gleaned. Finally, we’ll use Gradient Boosting and Support Vector Machines to predict if a patient will commit suicide.
As can be seen from the table above, the dataset has a wide variety of numeric and categorical data. We’ll dive deeper into this data now.
Let’s begin exploring by taking a high level look at our dataset:
glimpse(adhd)## Rows: 175
## Columns: 54
## $ Initial <chr> "JA", "LA", "MD", "RD", "RB", "SB", "PK", "R…
## $ Age <dbl> 24, 48, 51, 43, 34, 39, 41, 48, 44, 27, 44, …
## $ Sex <dbl> 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2,…
## $ Race <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ `ADHD Q1` <dbl> 1, 3, 2, 3, 4, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1,…
## $ `ADHD Q2` <dbl> 1, 3, 1, 3, 4, 3, 2, 4, 3, 3, 3, 2, 3, 3, 0,…
## $ `ADHD Q3` <dbl> 4, 4, 2, 2, 2, 1, 1, 3, 3, 4, 2, 2, 0, 2, 1,…
## $ `ADHD Q4` <dbl> 2, 4, 1, 2, 4, 4, 3, 4, 4, 4, 3, 3, 2, 4, 2,…
## $ `ADHD Q5` <dbl> 3, 5, 3, 4, 4, 3, 4, 3, 4, 4, 3, 3, 3, 2, 2,…
## $ `ADHD Q6` <dbl> 1, 2, 3, 3, 2, 2, 4, 3, 3, 3, 3, 2, 3, 0, 3,…
## $ `ADHD Q7` <dbl> 1, 2, 3, 2, 3, 3, 2, 1, 3, 4, 3, 1, 2, 4, 0,…
## $ `ADHD Q8` <dbl> 3, 3, 2, 4, 4, 4, 3, 1, 4, 3, 3, 2, 2, 4, 0,…
## $ `ADHD Q9` <dbl> 2, 2, 0, 4, 4, 4, 3, 4, 3, 2, 2, 3, 1, 2, 2,…
## $ `ADHD Q10` <dbl> 4, 4, 1, 2, 2, 2, 4, 2, 4, 4, 3, 2, 1, 4, 2,…
## $ `ADHD Q11` <dbl> 2, 1, 2, 3, 4, 4, 3, 4, 3, 3, 2, 2, 3, 3, 1,…
## $ `ADHD Q12` <dbl> 4, 4, 0, 1, 1, 2, 1, 1, 0, 1, 1, 1, 0, 1, 0,…
## $ `ADHD Q13` <dbl> 1, 2, 2, 3, 3, 4, 4, 4, 4, 3, 2, 3, 3, 1, 2,…
## $ `ADHD Q14` <dbl> 0, 4, 2, 3, 2, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3,…
## $ `ADHD Q15` <dbl> 3, 4, 3, 1, 1, 3, 4, 0, 3, 4, 1, 3, 2, 0, 1,…
## $ `ADHD Q16` <dbl> 1, 3, 2, 2, 2, 4, 4, 0, 2, 4, 2, 1, 1, 0, 1,…
## $ `ADHD Q17` <dbl> 3, 1, 1, 1, 1, 3, 2, 1, 4, 2, 1, 1, 0, 0, 3,…
## $ `ADHD Q18` <dbl> 4, 4, 1, 2, 1, 3, 4, 1, 3, 2, 2, 2, 1, 2, 1,…
## $ `ADHD Total` <dbl> 40, 55, 31, 45, 48, 55, 54, 41, 56, 56, 42, …
## $ `MD Q1a` <dbl> 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ `MD Q1b` <dbl> 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1,…
## $ `MD Q1c` <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,…
## $ `MD Q1d` <dbl> 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0,…
## $ `MD Q1e` <dbl> 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0,…
## $ `MD Q1f` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,…
## $ `MD Q1g` <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,…
## $ `MD Q1h` <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1,…
## $ `MD Q1i` <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1,…
## $ `MD Q1j` <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1,…
## $ `MD Q1k` <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1,…
## $ `MD Q1L` <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,…
## $ `MD Q1m` <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0,…
## $ `MD Q2` <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,…
## $ `MD Q3` <dbl> 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 0, 2, 2, 2,…
## $ `MD TOTAL` <dbl> 15, 14, 5, 13, 7, 14, 9, 7, 12, 11, 16, 0, 1…
## $ Alcohol <dbl> 1, 0, 0, 1, 1, 1, 3, 0, 1, 0, 3, 0, 1, 0, 3,…
## $ THC <dbl> 1, 0, 0, 1, 1, 0, 3, 0, 0, 3, 1, 0, 1, 0, 1,…
## $ Cocaine <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ Stimulants <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ `Sedative-hypnotics` <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ Opioids <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0,…
## $ `Court order` <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ Education <dbl> 11, 14, 12, 12, 9, 11, 12, 16, 12, 9, 12, 18…
## $ `Hx of Violence` <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1,…
## $ `Disorderly Conduct` <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ Suicide <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ Abuse <dbl> 0, 4, 6, 7, 0, 2, 4, 0, 0, 2, 4, 2, 0, 0, 0,…
## $ `Non-subst Dx` <dbl> 2, 1, 2, 2, 2, 0, 1, 2, 1, 1, 1, 2, 0, 1, 0,…
## $ `Subst Dx` <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 2, 0, 0, 0, 1,…
## $ `Psych meds.` <dbl> 2, 1, 1, 2, 0, 0, 1, 2, 1, 0, 0, 1, 0, 0, 0,…
We can see that there is a mix of both numeric and categorical data. We note that most of the numerical features are actually binary categorical variables. Before going too far, let’s change the data type of these variables so we can explore our data properly:
adhd <- adhd %>% mutate_if(is.double, as.factor)
adhd$Age <- as.integer(adhd$Age)
inspectdf::inspect_types(adhd) %>%
show_plotHaving made the necessary changes, we can see in the visual above, that our dataset is mostly categorical data. However, there is one column that is of type character and another column that is of type integer.
table <- describe(adhd$Age)[,c(2,8,3,5,9,4)]
rownames(table) <- c('Age')
knitr::kable(table) %>%
kable_styling()| n | min | mean | median | max | sd | |
|---|---|---|---|---|---|---|
| Age | 175 | 1 | 22.36571 | 25 | 42 | 10.94936 |
Looking at the summary above, we can see that the age of the subjects of this research study are between 1 and 42 with the median value being 25. Let’s now take a look at the shape of our distribution and see if there are significant outliers:
inspectdf::inspect_num(adhd) %>%
show_plot() In looking at the distribution of age, it appears that our dataset is multi-modal. This often means that there are sub-populations within the data. We’ll keep this in mind as we continue our exploration. We do not see any significant outliers.
Later, we will be building models to attempt to predict individuals who have committed suicide. As such, we’ll look at the distribution of Age broken down by whether an individual has attempted suicide or not:
ggplot(adhd %>% filter(Suicide %in% c(1,0)) ) +
aes( y = Age, fill = Suicide) +
geom_histogram() +
coord_flip()Based on the histogram above, we do not note any significant differences in the distributions. However, this may be more easily determined by looking at a boxplot.
ggplot(adhd %>% filter(Suicide %in% c(1,0))) +
aes(x = Suicide, y = Age) +
geom_boxplot(color = 'steelblue',
outlier.color = 'firebrick',
outlier.alpha = 0.35) +
labs(title = 'Suicide vs Loan_Status', y = 'Age', x= 'Suicide') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey",
linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)In looking at the boxplots above, we can see that it does appear that, in general, those who attempted suicide were younger than those who did not.
Now let’s turn our attention to our categorical variables. We have both binary categorical variables, and variables with 3 or more classes. Looking at this data in a table would be unwieldy due to its size, so we’ll visualize this data instead.
inspect_cat(adhd) %>%
show_plot()In looking at the summary table above, we note:
Abuse: Nearly half of our population has been subject to some type of abuseADHD Q1-18: There is a fairly equal dispersion between each of these responsesAlcohol: Alcohol appears to be used by half of our populationCocaine: About a 1/3 of the population uses cocaineCourt Order: Majority of the population does not have a court orderDisorderly Conduct: Over half of the population has had disorderly conductEducation: A very small percentage of population has attended collegeHx of Violence: Subjects with a history of violence is the minority of the populationMDQ1-MDQ3: There is a fairly even breakout in response for each of these questionsNon-subst Dx: We note the presence of quite a few missing values in this column. Most individuals do not have a non-substance prescriptionOpiods: Most subjects do not have opiodsPsych meds.: The majority of this column is missing dataRace: The majority of the population is either white or African AmericanSedative-hypnotics: Hardly anyone in the population is using sedative-hypnoticsSex: The population is a fairly even split between males and femalesStimulants: Stimulant usage among the participants is very minimalSubst Dx: A fairly even split between each of these categories. We note the presence of quite a few missing values in this column.Suicide: It appears that around around a 1/4 of the population has attempted suicideTHC: THC only affects around 1/3 of our populationAgain, we will be building models to predict who has attempted to commit suicide, so it will also be helpful to understand the the distributions of these variables as it relates to our Suicide variable, both Yes and No. As mentioned above, only 1/4 of the population has attempted suicide, so we expect counts for those who have attempted suicide to be considerably lower than those who have not, however, we perform this exercise to see anything major sticks out as a red flag.
Based on the charts above, we do not note anything we would consider out of the ordinary where suicide attempts outnumber those who have not attempted suicide for any of the variables. In fact, the distributions look fairly similar given our understanding of the percentage of those who have attempted suicide.
As we’ve seen throughout out EDA, there are quite a few missing values within the columns. Let’s take a closer look at this so we can determine the best way to deal with them.
visdat::vis_miss(adhd, sort_miss = TRUE)It appears that about 2.6% of our dataset is missing values and as previously mentioned, Psych meds. makes up a large portion of that. Because of this, there would be no good way to impute these values and so we have decided to drop this variable from the dataset.
adhd <- adhd %>%
select(-`Psych meds.` )Further, we see a pattern within the missing data that many of the respondents who have missing data about a suicide attempt, also have additional missing data. As our objective is to predict suicide attempts, we will drop those subjects who have missing data for that variable since the data will be unusable when it comes to our modeling.
adhd <- adhd %>%
filter(!is.na(Suicide))Having adjusted these, let’s now look at our missing data again.
visdat::vis_miss(adhd, sort_miss = TRUE)We can see clearly that these adjustments have made a significant impact on the dataset. There is now only 0.4% of our dataset that is missing. We can now use an imputation method to fill in the remainder of the missing values.
We have chosen to use the pmm method (predictive mean matching) from the mice library to impute our missing values. Predictive mean matching calculates the predicted value for our target variable, and, for missing values, forms a small set of “candidate donors” from the complete cases that are closest to the predicted value for our missing entry. Donors are then randomly chosen from candidates and imputed where values were once missing. To apply pmm we assume that the distribution is the same for missing cells as it is for observed data, and thus, the approach may be more limited when the % of missing values is higher.
In order to apply this method, we’ll need to adjust our column names that have spacing. We can use the clean_names function from the janitor library to do this for us. Once this is completed, we’ll impute our missing values:
adhd <- adhd %>%
janitor::clean_names()
adhd <- mice(data = adhd, m = 1, method = "pmm", seed = 500)##
## iter imp variable
## 1 1 education* abuse* non_subst_dx* subst_dx*
## 2 1 education* abuse* non_subst_dx* subst_dx*
## 3 1 education* abuse* non_subst_dx* subst_dx*
## 4 1 education* abuse* non_subst_dx* subst_dx*
## 5 1 education* abuse* non_subst_dx* subst_dx*
## * Please inspect the loggedEvents
## Warning: Number of logged events: 45
adhd <- mice::complete(adhd, 1)We can see now that we have successfully imputed all missing values:
visdat::vis_miss(adhd, sort_miss = TRUE)K-Means clustering is a technique of creating clusters by finding groups that are similar, as determined by euclidean distance. This method has many advantages (computationally efficient) and disadvantages (results change with every run, dependent on the starting values, no real way to determine appropriate number of clusters), but we ultimately chose to use k-means instead of hierarchical clustering due to there being no real reason to believe this data has a hierarchical structure to it.
The first step was to pull out only the columns we wanted to use. Since k-means is based on euclidean distance, we chose to only use variables that had continuous values, ruling out all binary and factor based predictors. Some of this data came in as factors (despite it being continuous), so we had to convert those to numeric values.
library(tidyverse)
adhd_km <- adhd %>%
select(age, md_total, education, adhd_total)
adhd_km <- adhd_km %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, as.numeric)The next step was to transform the data. Again, the euclidean based distance measurement calls for us to center and scale the data. We also chose to perform a Box-Cox transformation to help reduce skewness.
library(caret)
adhd_km_trans <- preProcess(adhd_km,
method = c("center", "scale", "BoxCox"))
adhd_km_trans <- predict(adhd_km_trans, adhd_km)Now that we have our data all set up, we could go ahead and get a sense of how many clusters to create. To do this, we used the silhouette method. When the average silhouette width starts to decrease, you stop gaining information by dividing clusters. This led us to an optimal number of 2 cluster.
library(factoextra)## Warning: package 'factoextra' was built under R version 3.6.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(adhd_km_trans, kmeans, method = "silhouette", k.max = 20)## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
Now we finally get to create and visualize our clusters! To visualize these four dimensional clusters, we used PCA to reduce the dimensionality of our data.
library(factoextra)
km.res <- kmeans(adhd_km_trans, centers = 2, nstart = 50)
fviz_cluster(km.res, adhd_km_trans)What incredible clusters! We wanted to get a better sense of the differences between these clusters and what made them unique, so we went back to the original data and performed a little bit of analysis. We computed the mean for each factor within each cluster and plotted these values on a graph to see the differences in the two clusters more clearly.
library(tidyverse)
adhd_km_res <- cbind(adhd_km, km.res$cluster)
km_comparison <- adhd_km_res %>%
group_by(km.res$cluster) %>%
rename(cluster = `km.res$cluster`) %>%
summarise_all(mean) %>%
pivot_longer(c("age":"adhd_total"),
names_to = "category",
values_to = "mean")
ggplot(km_comparison,
aes(x = as.factor(cluster),
y = mean,
fill = as.factor(category))) +
geom_bar(position = "dodge",
stat = "identity") +
scale_fill_brewer(palette = "Set3") +
labs(x = "Cluster",
y = "Value",
fill = "Factor")By looking at these two clusters, we can see that one of our clusters has a high total ADHD and MD score while also having a lower age and education. This would indicate that our two groups are the following: Younger and less educated with high ADHD and MD scores Older and more educated with lower ADHD and MD scores
Out of curiosity, we wanted to add back in a single binary variable - suicide - since we knew we would be investigating it later. We knew k-means would struggle with this, but we were curious to see how much it would impact the outcome.
We went through the same steps as before, but this time included suicide as our first variable.
library(tidyverse)
library(caret)
library(factoextra)
adhd_km <- adhd %>%
select(suicide, age, md_total, education, adhd_total)
adhd_km <- adhd_km %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, as.numeric)
adhd_km_trans <- preProcess(adhd_km,
method = c("center", "scale", "BoxCox"))
adhd_km_trans <- predict(adhd_km_trans, adhd_km)
fviz_nbclust(adhd_km_trans, kmeans, method = "silhouette", k.max = 20)km.res <- kmeans(adhd_km_trans, centers = 12, nstart = 50)
fviz_cluster(km.res, adhd_km_trans)adhd_km_res <- cbind(adhd_km, km.res$cluster)
km_comparison <- adhd_km_res %>%
group_by(km.res$cluster) %>%
rename(cluster = `km.res$cluster`) %>%
summarise_all(mean) %>%
pivot_longer(c("suicide":"adhd_total"),
names_to = "category",
values_to = "mean")
ggplot(km_comparison,
aes(x = as.factor(cluster),
y = mean,
fill = as.factor(category))) +
geom_bar(position = "dodge",
stat = "identity") +
scale_fill_brewer(palette = "Set3") +
labs(x = "Cluster",
y = "Value",
fill = "Factor")Adding in just one binary variable - suicide - had a massive impact on our clustering. The optimal number of clusters increased by 600% all the way up to 12 clusters. All of a sudden we have clusters that fit a bunch of different scenarios. Based on our clusters, we ended up with the following conclusions: * We had 4 groups that had non-zero means for suicide (meaning suicide was attempted) * Of these 4 groups, 3 had higher ADHD scores, 1 was an older group and 1 was a younger group, and education for all 4 groups was low.
We’re curious if the massive increase in the optimal number of clusters was caused by the addition of a single variable or the addition of a binary variable. The best way to tell would be to replace the binary suicide variable with another continuous variable, but we already used all of the provided continuous variables. We could always remove one of our original variables and retain the binary suicide variable, but even this would cause some level of distortion.
We begin by dropping the initial column as that may not carry any predictive information. The data is then split into training and testing
df_split = initial_split(adhd %>% select(-c(initial)), strata=suicide)
df_train <- training(df_split)
df_test <- testing(df_split)Gradient boosting machines are similar to random forests or bagging approaches, but instead of just growing a large number of trees from independent random samples of the data, we grow them sequentially on transformations of the data. Boosting is a method to improve (boost) the week learners sequentially and increase the model accuracy with a combined model.
After we tried most combinations of the different metrics, we used gbm function setting parameters the following way:
the number of trees to 1000, as the minimum number of trees used in the model ensemble is 100. And setting it too high would cause our model to overfit.
interaction depth (number of splits it has to perform on a tree) to 4.
distribution type to “adaboost”, as we are working with 0-1 outcomes
modelGBM=gbm(as.character(suicide)~., data=data.frame(df_train)
,n.trees=1000,distribution='adaboost',interaction.depth=4,shrinkage=0.01)
pgbm=predict(modelGBM,newdata=data.frame(df_test),n.trees = 1000,type='response')
pgbm[pgbm>0.5]=1
pgbm[pgbm<=0.5]=0
summary(modelGBM,
cBars = 10,
n.trees = modelGBM$n.trees,
plotit = TRUE,
order = TRUE,
method = relative.influence,
normalize = TRUE)## var rel.inf
## adhd_total adhd_total 8.286190e+01
## md_total md_total 7.902049e+00
## education education 1.182468e+00
## abuse abuse 1.167775e+00
## adhd_q14 adhd_q14 7.516594e-01
## adhd_q15 adhd_q15 6.714416e-01
## adhd_q1 adhd_q1 6.095127e-01
## adhd_q11 adhd_q11 5.135936e-01
## adhd_q10 adhd_q10 4.551679e-01
## adhd_q4 adhd_q4 4.135598e-01
## race race 3.706968e-01
## alcohol alcohol 2.724549e-01
## adhd_q17 adhd_q17 2.451256e-01
## adhd_q7 adhd_q7 2.378631e-01
## adhd_q18 adhd_q18 2.122014e-01
## age age 1.905235e-01
## adhd_q5 adhd_q5 1.900748e-01
## adhd_q16 adhd_q16 1.851144e-01
## adhd_q9 adhd_q9 1.814930e-01
## subst_dx subst_dx 1.592262e-01
## sex sex 1.436468e-01
## adhd_q12 adhd_q12 1.393376e-01
## adhd_q3 adhd_q3 1.331517e-01
## adhd_q6 adhd_q6 1.040196e-01
## adhd_q8 adhd_q8 1.037221e-01
## adhd_q2 adhd_q2 1.028580e-01
## adhd_q13 adhd_q13 8.494249e-02
## non_subst_dx non_subst_dx 8.409667e-02
## md_q3 md_q3 8.317285e-02
## cocaine cocaine 8.053583e-02
## md_q1m md_q1m 7.031097e-02
## hx_of_violence hx_of_violence 4.110973e-02
## thc thc 2.073343e-02
## md_q1d md_q1d 7.304742e-03
## md_q1j md_q1j 7.274572e-03
## disorderly_conduct disorderly_conduct 6.218441e-03
## md_q1a md_q1a 3.724065e-03
## md_q1k md_q1k 2.639621e-03
## md_q1g md_q1g 2.542308e-03
## md_q1e md_q1e 2.400188e-03
## md_q1i md_q1i 1.953731e-03
## md_q1c md_q1c 1.711626e-04
## md_q1h md_q1h 1.673989e-04
## md_q1l md_q1l 6.609901e-05
## md_q2 md_q2 7.590723e-07
## md_q1b md_q1b 0.000000e+00
## md_q1f md_q1f 0.000000e+00
## stimulants stimulants 0.000000e+00
## sedative_hypnotics sedative_hypnotics 0.000000e+00
## opioids opioids 0.000000e+00
## court_order court_order 0.000000e+00
The summary function gave us a variable importance plot. adhd_totaland md_total are the most important variables in the model.
confusionMatrix(as.factor(df_test$suicide),as.factor(pgbm))## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 26 3
## 1 8 5
##
## Accuracy : 0.7381
## 95% CI : (0.5796, 0.8614)
## No Information Rate : 0.8095
## P-Value [Acc > NIR] : 0.9112
##
## Kappa : 0.3145
##
## Mcnemar's Test P-Value : 0.2278
##
## Sensitivity : 0.7647
## Specificity : 0.6250
## Pos Pred Value : 0.8966
## Neg Pred Value : 0.3846
## Prevalence : 0.8095
## Detection Rate : 0.6190
## Detection Prevalence : 0.6905
## Balanced Accuracy : 0.6949
##
## 'Positive' Class : 0
##
It appears that Boosting has a correct classification rate of 74%. We’ll use this as a benchmark as we move forward to an SVM method.
A Support Vector Machine (SVM) is an algorithm that searches the feature space for the optimal hyper plane. This hyper plane will separate the features by classes with the maximum margin. Here we train an SVM to find the dividing plane between those who commit suicide and those who don’t base on the features we have.
The trained dataset is fit to an SVM
set.seed(42)
svm_model = svm(suicide~., data=df_train, kernel='linear', probability=TRUE)
summary(svm_model)##
## Call:
## svm(formula = suicide ~ ., data = df_train, kernel = "linear",
## probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 81
##
## ( 50 31 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Our base model consists of 78 support vectors where 45 are assign to label 0 (no suicide) and 33 to label 1 (suicide)
We will tune the svm with the training set to find the best gamma and cost. The tuning process is done with 10 fold cross validation.
svm_tune <- tune.svm(suicide~., data = df_train, gamma = 0.25, cost = seq(0.1,1,0.1))
summary(svm_tune)##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.25 0.1
##
## - best performance: 0.3
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 0.25 0.1 0.3 0.1191534
## 2 0.25 0.2 0.3 0.1191534
## 3 0.25 0.3 0.3 0.1191534
## 4 0.25 0.4 0.3 0.1191534
## 5 0.25 0.5 0.3 0.1191534
## 6 0.25 0.6 0.3 0.1191534
## 7 0.25 0.7 0.3 0.1191534
## 8 0.25 0.8 0.3 0.1191534
## 9 0.25 0.9 0.3 0.1191534
## 10 0.25 1.0 0.3 0.1191534
svm_best <- svm_tune$best.modelIt looks like the best parameters are gamma = 0.25 and cost = 0.1
##
## Call:
## svm(formula = suicide ~ ., data = df_train, cost = 0.1, gamma = 0.25,
## kernel = "linear", probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.1
##
## Number of Support Vectors: 88
##
## ( 52 36 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Now with the best model we can test it aganist the train and test set
y_train = predict(svm_best, df_train %>% select(-c(suicide)))
table(df_train[,'suicide'], y_train)## y_train
## 0 1
## 0 84 0
## 1 5 31
y_test = predict(svm_best, df_test %>% select(-c(suicide)))
table(df_test[,'suicide'], y_test)## y_test
## 0 1
## 0 24 5
## 1 6 7
confusionMatrix(df_test[,'suicide'], y_test, positive='1')## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 24 5
## 1 6 7
##
## Accuracy : 0.7381
## 95% CI : (0.5796, 0.8614)
## No Information Rate : 0.7143
## P-Value [Acc > NIR] : 0.4418
##
## Kappa : 0.374
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.5833
## Specificity : 0.8000
## Pos Pred Value : 0.5385
## Neg Pred Value : 0.8276
## Prevalence : 0.2857
## Detection Rate : 0.1667
## Detection Prevalence : 0.3095
## Balanced Accuracy : 0.6917
##
## 'Positive' Class : 1
##
As we can see the SVM model has 4 False Positive on the training set. However it makes more mistakes on the testing set, which is to be expected.
pred <- predict(svm_best, df_test, decision.values = TRUE, probability = TRUE)
pred_prob <- attr(pred, 'probabilities')
df_prob <- cbind(df_test %>% select(suicide), y_test, pred_prob)
names(df_prob) = c('y_true' ,'y_pred', 'negative', 'positive')
summary(df_prob)## y_true y_pred negative positive
## 0:29 0:30 Min. :0.6404 Min. :0.2302
## 1:13 1:12 1st Qu.:0.6717 1st Qu.:0.2822
## Median :0.6935 Median :0.3065
## Mean :0.6954 Mean :0.3046
## 3rd Qu.:0.7178 3rd Qu.:0.3283
## Max. :0.7698 Max. :0.3596
We can see from the predicted probabilities the SVM model is more likely to predict that an individual will not commit suicide.
PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The i-th principal component can be taken as a direction orthogonal to the first i-1 principal components that maximizes the variance of the projected data.
(Source : https://en.wikipedia.org/wiki/Principal_component_analysis)
We’ll use the prcomp() function to reduce the number of dimensions on our entire dataset and measure the amount of variance explained is beneficial: As mentioned PCA works best with numerical data we will neglect the categorical variable Initials. We are now left with a matrix of 52 columns and 150+ rows which we will pass through prcomp( ) function for the principal component analysis.
This function returns the results as an object of class ‘prcomp’. We will assign the output to a variable named pca.out
For our PCA analysis we will concentrate on the Mood disorders questions. As the mood swings seems to be a strong reason for suicide attempts.
#working on copy for pca analysis
adhd.pca <- data.frame(adhd)
#droping character column for PCA analysis.
adhd.pca <- adhd.pca[,-1]
#droping ADHD disorders questionnaire column for PCA analysis.
adhd.pca <- select(adhd.pca , -c(adhd_q1,adhd_q2, adhd_q3, adhd_q4, adhd_q5, adhd_q6, adhd_q7
, adhd_q8, adhd_q9, adhd_q10, adhd_q11, adhd_q12
, adhd_q13, adhd_q14, adhd_q15, adhd_q16, adhd_q17
, adhd_q18, adhd_total ))
#converting columns to numeric from factor ofr PCA analysi
adhd.pca$sex <- as.numeric(adhd.pca$sex)
adhd.pca$age <- as.numeric(adhd.pca$age)
adhd.pca$race <- as.numeric(adhd.pca$race)
#adhd.pca$adhd_q1 <- as.numeric(adhd.pca$adhd_q1)
#adhd.pca$adhd_q2 <- as.numeric(adhd.pca$adhd_q2)
#adhd.pca$adhd_q3 <- as.numeric(adhd.pca$adhd_q3)
#adhd.pca$adhd_q4 <- as.numeric(adhd.pca$adhd_q4)
#adhd.pca$adhd_q5 <- as.numeric(adhd.pca$adhd_q5)
#adhd.pca$adhd_q6 <- as.numeric(adhd.pca$adhd_q6)
#adhd.pca$adhd_q7 <- as.numeric(adhd.pca$adhd_q7)
#adhd.pca$adhd_q8 <- as.numeric(adhd.pca$adhd_q8)
#adhd.pca$adhd_q9 <- as.numeric(adhd.pca$adhd_q9)
#adhd.pca$adhd_q10 <- as.numeric(adhd.pca$adhd_q10)
#adhd.pca$adhd_q11 <- as.numeric(adhd.pca$adhd_q11)
#adhd.pca$adhd_q12 <- as.numeric(adhd.pca$adhd_q12)
#adhd.pca$adhd_q13 <- as.numeric(adhd.pca$adhd_q13)
#adhd.pca$adhd_q14 <- as.numeric(adhd.pca$adhd_q14)
#adhd.pca$adhd_q15 <- as.numeric(adhd.pca$adhd_q15)
#adhd.pca$adhd_q16 <- as.numeric(adhd.pca$adhd_q16)
#adhd.pca$adhd_q17 <- as.numeric(adhd.pca$adhd_q17)
#adhd.pca$adhd_q18 <- as.numeric(adhd.pca$adhd_q18)
#adhd.pca$adhd_total <- as.numeric(adhd.pca$adhd_total)
adhd.pca$md_q1a <- as.numeric(adhd.pca$md_q1a)
adhd.pca$md_q1b <- as.numeric(adhd.pca$md_q1b)
adhd.pca$md_q1c <- as.numeric(adhd.pca$md_q1c)
adhd.pca$md_q1d <- as.numeric(adhd.pca$md_q1d)
adhd.pca$md_q1e <- as.numeric(adhd.pca$md_q1e)
adhd.pca$md_q1f <- as.numeric(adhd.pca$md_q1f)
adhd.pca$md_q1g <- as.numeric(adhd.pca$md_q1g)
adhd.pca$md_q1h <- as.numeric(adhd.pca$md_q1h)
adhd.pca$md_q1i <- as.numeric(adhd.pca$md_q1i)
adhd.pca$md_q1j <- as.numeric(adhd.pca$md_q1j)
adhd.pca$md_q1k <- as.numeric(adhd.pca$md_q1k)
adhd.pca$md_q1l <- as.numeric(adhd.pca$md_q1l)
adhd.pca$md_q1m <- as.numeric(adhd.pca$md_q1m)
adhd.pca$md_q2 <- as.numeric(adhd.pca$md_q2)
adhd.pca$md_q3 <- as.numeric(adhd.pca$md_q3)
adhd.pca$md_total <- as.numeric(adhd.pca$md_total)
adhd.pca$alcohol <- as.numeric(adhd.pca$alcohol)
adhd.pca$thc <- as.numeric(adhd.pca$thc)
adhd.pca$cocaine <- as.numeric(adhd.pca$cocaine)
adhd.pca$stimulants <- as.numeric(adhd.pca$stimulants)
adhd.pca$sedative_hypnotics <- as.numeric(adhd.pca$sedative_hypnotics)
adhd.pca$opioids <- as.numeric(adhd.pca$opioids)
adhd.pca$court_order <- as.numeric(adhd.pca$court_order)
adhd.pca$education <- as.numeric(adhd.pca$education)
adhd.pca$hx_of_violence <- as.numeric(adhd.pca$education)
adhd.pca$disorderly_conduct <- as.numeric(adhd.pca$disorderly_conduct)
adhd.pca$suicide <- as.numeric(adhd.pca$suicide)
adhd.pca$abuse <- as.numeric(adhd.pca$abuse)
adhd.pca$non_subst_dx <- as.numeric(adhd.pca$non_subst_dx)
adhd.pca$subst_dx <- as.numeric(adhd.pca$subst_dx)
#prcomp function
pca.out = prcomp(adhd.pca, scale= TRUE )
pca.out## Standard deviations (1, .., p=33):
## [1] 2.649866e+00 1.667582e+00 1.517793e+00 1.410442e+00 1.317541e+00
## [6] 1.234053e+00 1.203317e+00 1.107268e+00 1.053497e+00 1.012235e+00
## [11] 9.818435e-01 9.567827e-01 9.126114e-01 8.790474e-01 8.642734e-01
## [16] 8.389314e-01 8.141526e-01 7.718777e-01 7.511017e-01 7.202164e-01
## [21] 6.961629e-01 6.542946e-01 6.371036e-01 6.321703e-01 5.901505e-01
## [26] 5.827049e-01 5.224433e-01 5.091218e-01 4.918273e-01 4.603005e-01
## [31] 4.076608e-01 3.891165e-02 4.072422e-16
##
## Rotation (n x k) = (33 x 33):
## PC1 PC2 PC3 PC4
## age 0.03469272 -0.02795421 0.10612354 -0.27323774
## sex -0.03529444 -0.22687351 -0.08987261 0.06522574
## race 0.06586145 0.28566259 0.22420588 -0.19317186
## md_q1a -0.27044583 0.02843719 -0.09549368 -0.07566968
## md_q1b -0.24862956 -0.17445100 -0.06222698 -0.07295610
## md_q1c -0.15515216 0.26266927 0.21383572 -0.02629291
## md_q1d -0.21969524 -0.01139197 0.05523678 -0.07662104
## md_q1e -0.23075163 0.04234849 0.14067828 -0.05575442
## md_q1f -0.24413027 -0.17143952 0.02798320 -0.02055926
## md_q1g -0.24696673 -0.20604513 0.02006128 -0.10716428
## md_q1h -0.22245811 0.14788474 0.18999798 0.11320563
## md_q1i -0.21574050 0.16426516 0.15323829 0.03519754
## md_q1j -0.21277108 0.12077625 0.18128584 0.04417958
## md_q1k -0.20291342 0.17075271 0.07676026 0.08296841
## md_q1l -0.26635638 0.01470877 -0.05531620 0.03779050
## md_q1m -0.21100877 0.05281260 0.01556825 0.02569672
## md_q2 -0.26244396 -0.16156199 0.13637663 0.03491599
## md_q3 -0.18692853 -0.25054736 -0.09241663 -0.07575819
## md_total -0.37077958 -0.02033346 0.08336609 -0.02175389
## alcohol -0.12004834 0.08896461 -0.09911649 0.13433274
## thc -0.02050055 0.19066834 -0.19793178 0.15824150
## cocaine -0.04517386 0.36426575 -0.06055117 -0.08625893
## stimulants -0.03777763 -0.02694936 -0.11917140 0.19607340
## sedative_hypnotics -0.01304696 -0.05938442 -0.12763572 0.35183905
## opioids -0.05247028 -0.11539783 -0.14156431 0.38222201
## court_order -0.04106587 0.02931041 -0.15043748 0.07385034
## education 0.10651109 -0.12501654 0.44105813 0.37698186
## hx_of_violence 0.10651109 -0.12501654 0.44105813 0.37698186
## disorderly_conduct -0.06230218 0.21697473 -0.26654149 0.04419898
## suicide -0.13273367 -0.15471925 -0.16679636 0.15351100
## abuse -0.04040704 -0.15737653 -0.22772880 0.01583052
## non_subst_dx 0.01155786 -0.32267256 0.05776974 -0.13821692
## subst_dx -0.06300744 0.28407666 -0.20726604 0.35400797
## PC5 PC6 PC7 PC8
## age 0.47278164 -0.270578426 0.19775076 -0.04330043
## sex 0.22789549 0.422738904 0.10345553 0.35686128
## race -0.09637874 0.111913168 0.02023633 0.28232588
## md_q1a -0.03782752 -0.126405201 0.01744824 -0.01413718
## md_q1b -0.16543807 -0.014347712 -0.12662419 0.17346536
## md_q1c 0.09490764 -0.040626756 -0.01941350 0.06331634
## md_q1d 0.05512104 -0.050769426 0.11001808 0.02195578
## md_q1e 0.14622817 0.087617400 0.13069549 0.15040371
## md_q1f -0.06466367 -0.023417435 0.04733737 0.07619921
## md_q1g -0.08725510 0.001101435 0.03903264 0.11720881
## md_q1h 0.06245039 0.271122543 -0.04237007 -0.04853672
## md_q1i 0.14250045 0.191593915 0.01207008 -0.08599415
## md_q1j 0.03831358 0.293985344 -0.04203836 -0.21677885
## md_q1k -0.07537880 0.174498227 -0.05990734 -0.30203480
## md_q1l -0.07001815 -0.273978098 -0.11545942 -0.06367887
## md_q1m 0.12188641 -0.282952915 -0.09231209 -0.03952059
## md_q2 -0.19175816 -0.117596274 -0.00373041 0.16279108
## md_q3 -0.05749273 -0.112697053 -0.01660499 0.12651368
## md_total -0.01815860 -0.016876427 -0.02703472 0.02545512
## alcohol 0.15374251 -0.243420922 0.31313585 -0.27517669
## thc -0.34744264 0.083090024 0.26117533 0.31251349
## cocaine 0.21699906 0.025545086 0.17392878 0.18916273
## stimulants -0.19337073 -0.041896184 0.15751462 0.04026151
## sedative_hypnotics 0.22457658 0.059213989 -0.37050617 -0.07201711
## opioids 0.21786394 0.029460240 -0.26879342 0.04486552
## court_order -0.23985533 0.271707256 0.35087544 -0.30921483
## education -0.04645289 -0.158187822 0.21367705 0.05999551
## hx_of_violence -0.04645289 -0.158187822 0.21367705 0.05999551
## disorderly_conduct -0.14475021 -0.235385249 0.13374782 -0.12281490
## suicide 0.08924076 0.119062079 0.15529349 -0.15570934
## abuse 0.32599073 0.060288594 0.33683599 0.16236720
## non_subst_dx 0.04779839 0.130844302 0.25223430 -0.29076332
## subst_dx 0.15380509 -0.103606673 0.07469640 0.21527276
## PC9 PC10 PC11 PC12
## age 0.133176264 -3.052525e-02 -0.015951238 -0.083591933
## sex 0.039997480 1.466959e-01 -0.023498922 -0.100533619
## race -0.386907599 -7.643193e-02 0.126265212 -0.072983447
## md_q1a 0.032414083 7.127731e-02 0.180503066 0.190896260
## md_q1b -0.105960674 -1.246399e-01 -0.134001492 -0.007910102
## md_q1c 0.094016163 1.737132e-02 -0.080067316 -0.099504904
## md_q1d 0.025009366 -2.353973e-01 0.364118655 -0.393120078
## md_q1e 0.102473263 2.123074e-01 0.168033281 0.151930884
## md_q1f 0.131852140 -1.103187e-01 -0.059881380 0.015404054
## md_q1g -0.005209731 -1.098592e-01 -0.025440045 -0.003290307
## md_q1h 0.086040431 -2.642789e-03 0.036487092 -0.015638509
## md_q1i 0.194669499 2.891357e-02 -0.210925587 -0.077573658
## md_q1j 0.046316217 1.288044e-01 -0.010286094 0.085761970
## md_q1k -0.233588519 -1.926662e-01 0.172746114 0.273791842
## md_q1l 0.024765307 -1.858152e-01 0.039486294 0.265833170
## md_q1m -0.155035075 2.676809e-01 -0.306783859 -0.017530621
## md_q2 0.030476658 -1.200344e-01 -0.046194288 -0.095152168
## md_q3 -0.150642388 3.723367e-01 -0.196165357 -0.123396779
## md_total -0.004459661 5.205068e-02 -0.041352876 0.001041501
## alcohol -0.362609156 2.157563e-01 -0.061603604 -0.102738704
## thc 0.019397114 -8.222242e-02 -0.251419520 0.180575433
## cocaine 0.114413785 -2.693442e-02 -0.106170776 0.068926781
## stimulants 0.154433646 5.056098e-01 0.558068390 0.094704649
## sedative_hypnotics -0.088145406 -4.956261e-03 -0.144510740 0.209193232
## opioids 0.276744596 -8.693465e-02 0.077018016 -0.265238048
## court_order 0.053788812 7.626777e-02 -0.296854189 -0.294052851
## education 0.013529385 -6.513594e-02 -0.048118528 0.032910360
## hx_of_violence 0.013529385 -6.513594e-02 -0.048118528 0.032910360
## disorderly_conduct 0.437089938 -2.092485e-01 -0.011507480 -0.094311994
## suicide -0.372355970 -2.656045e-01 0.185347415 -0.272121979
## abuse -0.080053240 -2.518829e-01 -0.003927053 0.351770988
## non_subst_dx 0.142774767 -1.547262e-02 -0.048952407 0.321002849
## subst_dx -0.134550716 -5.005299e-05 0.036022214 0.010185238
## PC13 PC14 PC15 PC16
## age -0.0566835833 0.163616334 0.012297629 0.24199815
## sex -0.3149072797 -0.047271547 0.225311064 -0.08363222
## race 0.1481320472 0.082170128 0.015323069 -0.08418820
## md_q1a -0.0307574353 -0.142836076 0.137302947 -0.19383557
## md_q1b -0.0075661030 -0.090756403 0.055728793 0.02457904
## md_q1c 0.4587862022 -0.420417251 -0.115378732 0.11296587
## md_q1d 0.0232887585 -0.250551962 0.179881414 -0.08397800
## md_q1e -0.1700571432 -0.280241398 0.077864771 0.26432469
## md_q1f -0.2377936951 0.056933826 -0.068781708 0.24276227
## md_q1g 0.1113020936 0.347037567 -0.011928770 -0.06340376
## md_q1h -0.2021262501 0.055227529 -0.411538837 -0.01267809
## md_q1i 0.1391623356 0.152084344 -0.400990623 -0.02541391
## md_q1j 0.0006926974 0.079675318 0.436293234 -0.09918934
## md_q1k -0.1455080744 0.008480059 0.065864158 -0.01353261
## md_q1l 0.0208396651 0.267129995 0.045754453 -0.08946704
## md_q1m -0.0864165864 0.112193589 0.036567748 -0.01924319
## md_q2 0.0100129311 0.097308663 0.029697232 0.15526761
## md_q3 0.2567400929 -0.128481931 0.005953032 -0.21375150
## md_total 0.0478185986 -0.028648309 0.009420751 -0.02407997
## alcohol -0.3247676383 -0.104343186 -0.148056062 -0.02649561
## thc -0.1087895821 -0.169723443 -0.205027261 0.08020182
## cocaine 0.1594818917 0.320854953 0.252912706 -0.09741821
## stimulants 0.2426042800 0.244189866 -0.118722054 0.20153971
## sedative_hypnotics 0.2309244363 -0.162910974 0.203122192 0.41381435
## opioids 0.0015252036 0.086552953 -0.121570123 -0.28590673
## court_order 0.1636974608 0.154698357 0.245582510 0.15316215
## education 0.0163118730 0.008955391 0.115138035 -0.04181500
## hx_of_violence 0.0163118730 0.008955391 0.115138035 -0.04181500
## disorderly_conduct -0.0702596164 -0.190800072 0.151453792 0.01470623
## suicide 0.1963560061 0.030606298 -0.110984797 0.15148028
## abuse 0.2053274927 0.033526867 -0.080407467 0.09662056
## non_subst_dx 0.1979817441 -0.226810587 -0.112313461 -0.41547120
## subst_dx 0.0329927791 -0.014852893 0.042269592 -0.32352038
## PC17 PC18 PC19 PC20
## age -0.1870326699 -0.02365295 -0.0117718053 -0.077746561
## sex -0.1462178029 -0.08763934 -0.2634067449 0.022750722
## race 0.0953467402 0.03638251 -0.2031385134 0.029784290
## md_q1a 0.0174326509 -0.01902714 -0.2180371086 0.449362965
## md_q1b -0.3718839962 0.09825459 -0.0278371827 -0.023705035
## md_q1c 0.1194484634 -0.16435465 -0.0789192301 -0.213613049
## md_q1d 0.0619530246 0.36819155 -0.0099590179 0.003661264
## md_q1e 0.2614331687 -0.17702293 0.1789539877 0.003872197
## md_q1f 0.3003612487 0.22281552 0.2723682222 -0.099204693
## md_q1g 0.2109967516 -0.37334403 0.0683324997 -0.152417977
## md_q1h 0.0006187089 0.19273542 -0.0585774650 0.369583110
## md_q1i -0.2992088240 0.02755310 -0.1363472552 -0.043488447
## md_q1j -0.1472169243 -0.28207854 -0.0881950334 -0.246076181
## md_q1k -0.0771208635 0.20794337 0.1232363649 -0.285598549
## md_q1l 0.0057437817 0.02069467 -0.2362904536 -0.010916597
## md_q1m 0.3973305215 -0.02120433 -0.2934970419 0.164172519
## md_q2 -0.1648016440 -0.13969418 0.2080801007 0.018659831
## md_q3 -0.2105938751 0.24904888 0.1271540043 -0.059873173
## md_total -0.0197334612 0.06137333 -0.0006977298 -0.027775182
## alcohol -0.1820015150 -0.07026246 0.0956119986 -0.144798299
## thc -0.0236739359 -0.12382008 0.0165876036 -0.013157247
## cocaine -0.1228473101 0.11275844 0.4526434824 0.316320096
## stimulants -0.1045699017 0.05519618 -0.0528155014 -0.053845352
## sedative_hypnotics -0.0595682203 0.09175329 0.0875320511 0.202051868
## opioids 0.1754332562 -0.01200602 0.0329000806 -0.169082560
## court_order 0.2611983949 0.24423125 -0.0924090299 0.006698803
## education -0.0510309202 0.04415637 -0.0586475426 0.072776513
## hx_of_violence -0.0510309202 0.04415637 -0.0586475426 0.072776513
## disorderly_conduct -0.1783780631 -0.15117536 -0.1927103963 0.040641543
## suicide -0.0382237128 -0.39632642 0.0465325545 0.322696844
## abuse 0.0588274512 0.23558516 -0.3264000560 -0.214625464
## non_subst_dx 0.0973527244 -0.07849497 0.2071535114 0.137387595
## subst_dx 0.1188237993 -0.02903089 0.2173885911 -0.155169110
## PC21 PC22 PC23 PC24
## age -0.170897495 0.193573958 0.0255079652 0.31301604
## sex 0.047421155 0.134956383 0.1010911002 0.08370083
## race 0.159887305 -0.141562557 -0.1879680691 0.48740095
## md_q1a 0.161547706 0.315140028 -0.2192332996 -0.17167110
## md_q1b -0.288197625 -0.101481533 -0.2228550358 -0.16766404
## md_q1c -0.044161287 0.324717761 -0.0880631348 -0.13563961
## md_q1d -0.115488196 -0.250324317 0.4180957121 -0.03936866
## md_q1e -0.064158483 -0.256973739 -0.2646152392 0.01724736
## md_q1f 0.463246339 0.298199187 0.0965312647 0.15877717
## md_q1g 0.091283278 -0.392803372 -0.1218131778 -0.05134410
## md_q1h -0.051777496 -0.121814172 -0.1178974225 -0.02013894
## md_q1i 0.239243829 -0.143772590 0.1948224884 -0.05787820
## md_q1j 0.057730397 0.064083649 0.1811352578 0.01645527
## md_q1k -0.253871846 0.135164775 -0.1083114379 0.19114450
## md_q1l 0.103771257 0.124228992 0.0913237629 0.10945543
## md_q1m -0.356974559 -0.083875233 0.2649960846 0.07664469
## md_q2 -0.171850246 0.031716664 0.0712162935 -0.12558114
## md_q3 0.162531921 0.039002813 -0.1061998218 0.28323765
## md_total 0.005290795 0.019323755 -0.0002904846 0.04265021
## alcohol 0.101046295 -0.130950510 -0.1743377008 -0.05086443
## thc -0.260448159 0.115118494 0.2559443322 0.26543853
## cocaine -0.133665496 0.038646011 -0.0878104464 -0.03711447
## stimulants -0.112060163 -0.005252219 0.0954619503 0.07688555
## sedative_hypnotics 0.129499260 -0.260596252 0.0809655188 0.20148241
## opioids -0.256493743 0.120058085 -0.3469397936 0.24416249
## court_order -0.046960983 0.043873170 -0.1999846191 -0.06160760
## education 0.027584211 0.001538790 -0.0669431107 0.00637302
## hx_of_violence 0.027584211 0.001538790 -0.0669431107 0.00637302
## disorderly_conduct 0.165207538 -0.272893665 -0.1472912389 0.27901502
## suicide 0.016529805 0.212829718 0.0659077659 0.10013141
## abuse -0.014115202 -0.081850623 -0.1216962003 -0.19786384
## non_subst_dx -0.112290928 -0.101798580 0.1509651126 0.23410361
## subst_dx 0.185699385 -0.032378270 0.1891561946 -0.17117134
## PC25 PC26 PC27 PC28
## age -0.26785346 -0.002253169 -0.127366151 -0.309086965
## sex 0.04916629 0.089355875 -0.264718977 -0.073307282
## race -0.08937166 -0.224083967 -0.014000175 0.082232949
## md_q1a -0.34031571 -0.095689106 -0.013450032 0.066429003
## md_q1b 0.18105630 -0.347834216 0.199691499 -0.443849874
## md_q1c 0.07882840 -0.110232027 -0.216744793 -0.082686496
## md_q1d -0.15329029 -0.045832716 0.113508812 0.041260242
## md_q1e -0.01914802 0.344548430 0.343917831 -0.046766762
## md_q1f 0.23394239 -0.318611945 0.047248561 -0.009746954
## md_q1g -0.13657360 -0.035592007 -0.209064624 -0.165224788
## md_q1h 0.05390273 -0.062122306 -0.299977635 -0.128100563
## md_q1i -0.08160362 0.149251088 0.257758807 0.058417783
## md_q1j 0.04203937 -0.275594820 0.180658726 0.134289661
## md_q1k 0.11576300 0.274000729 -0.244072892 -0.006456198
## md_q1l -0.15274544 0.230878519 0.194873631 -0.176675641
## md_q1m 0.30781921 -0.046028033 -0.042117541 0.088223976
## md_q2 -0.21134622 0.056851394 -0.368644429 0.474590528
## md_q3 0.08944600 0.325697793 -0.008096621 0.083313370
## md_total 0.01980200 0.047655620 0.007179400 0.002639981
## alcohol -0.15476069 -0.290611714 0.017444649 0.158125747
## thc -0.26301823 -0.041918997 0.180623314 0.015019790
## cocaine 0.18639578 -0.071499579 0.098500391 0.136635927
## stimulants 0.09543156 -0.178072306 -0.063935200 -0.067730874
## sedative_hypnotics -0.24287796 -0.141420419 -0.110126323 -0.023261608
## opioids -0.13656991 -0.137744089 0.199086321 0.175074811
## court_order -0.23445061 0.070125881 -0.035078251 -0.164432097
## education 0.05753363 0.030861419 0.049668262 -0.048468733
## hx_of_violence 0.05753363 0.030861419 0.049668262 -0.048468733
## disorderly_conduct 0.33326963 0.043939163 -0.202050282 0.062014207
## suicide 0.25613273 0.138254753 0.164969828 -0.044223272
## abuse 0.12189697 -0.042149329 0.004592098 0.299047946
## non_subst_dx -0.01478651 -0.183997902 -0.092480103 -0.071575151
## subst_dx -0.04425541 0.078928034 -0.209057892 -0.363240849
## PC29 PC30 PC31 PC32
## age 0.180608169 0.005244402 -0.191215927 0.0047882298
## sex -0.202131934 0.049319791 0.327907429 -0.0010897764
## race 0.170331437 -0.136731709 0.166632077 0.0046861276
## md_q1a 0.214050074 0.316998687 -0.059936603 -0.0884555682
## md_q1b 0.172297408 -0.061020840 0.118270328 -0.0979929477
## md_q1c -0.285169720 0.012657671 0.127109383 -0.1010277811
## md_q1d -0.158208574 0.106247058 -0.042375739 -0.0917618750
## md_q1e 0.116353935 -0.155536570 0.071895756 -0.0915455394
## md_q1f 0.071828922 0.049059380 0.050287944 -0.0979440165
## md_q1g -0.225746161 0.390188385 -0.141826074 -0.0767674339
## md_q1h -0.163932462 -0.274730610 -0.380673251 -0.0795874460
## md_q1i 0.259893527 0.240654902 0.281301014 -0.1032690575
## md_q1j 0.050071871 -0.178171936 -0.393095996 -0.0931272576
## md_q1k 0.068913345 0.334696850 0.106622557 -0.1026426523
## md_q1l -0.399868384 -0.410440999 0.224667155 -0.1007298829
## md_q1m 0.141926964 0.105292203 0.072805250 -0.0998480201
## md_q2 0.245649787 -0.319154375 0.160377612 -0.0913522530
## md_q3 -0.070802457 -0.011747432 -0.314380498 -0.2046759901
## md_total -0.007658571 0.017569575 -0.032249499 0.9127971557
## alcohol -0.265887230 -0.028656734 0.167401154 0.0019317750
## thc -0.106577691 0.140404296 -0.212588417 -0.0012956326
## cocaine -0.242046781 0.090716536 0.111358104 -0.0082806317
## stimulants 0.010480051 0.027641891 0.094178121 -0.0025172554
## sedative_hypnotics -0.058346759 0.074134262 0.046529907 0.0021420482
## opioids 0.036590202 0.010835635 0.024435561 -0.0037791273
## court_order 0.078890841 -0.120406802 0.090754495 -0.0007958411
## education -0.015385660 0.091091406 -0.023997820 0.0004329570
## hx_of_violence -0.015385660 0.091091406 -0.023997820 0.0004329570
## disorderly_conduct 0.076250065 -0.004390021 -0.006114958 0.0067759746
## suicide 0.044638381 -0.015072008 -0.057500849 0.0034026240
## abuse 0.063769436 -0.080075987 -0.173345676 0.0056517048
## non_subst_dx 0.069950480 -0.146471307 0.227337964 0.0055267665
## subst_dx 0.360430181 -0.176006698 -0.042213313 0.0073415984
## PC33
## age 0.000000e+00
## sex 4.446564e-17
## race -1.000491e-16
## md_q1a 2.588352e-16
## md_q1b 3.507338e-16
## md_q1c 6.302357e-16
## md_q1d 5.178221e-16
## md_q1e 1.907546e-16
## md_q1f 1.233500e-16
## md_q1g 2.085367e-16
## md_q1h 2.778133e-16
## md_q1i 2.859931e-16
## md_q1j 4.103841e-16
## md_q1k 3.666871e-16
## md_q1l 2.954597e-16
## md_q1m 2.173715e-16
## md_q2 3.447506e-16
## md_q3 5.957333e-16
## md_total -3.085676e-15
## alcohol 2.859500e-16
## thc -2.062939e-17
## cocaine -1.448070e-16
## stimulants -5.207871e-17
## sedative_hypnotics 2.756697e-17
## opioids -4.820730e-19
## court_order 1.379461e-16
## education 7.071068e-01
## hx_of_violence -7.071068e-01
## disorderly_conduct -1.034042e-16
## suicide -1.493044e-16
## abuse -1.382459e-16
## non_subst_dx -1.450787e-16
## subst_dx -8.218822e-17
names(pca.out)## [1] "sdev" "rotation" "center" "scale" "x"
Next, we will print the summary of the prcomp object and draw a biplot
# summary of the prcomp object
summary(pca.out)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.6499 1.66758 1.51779 1.41044 1.3175 1.23405
## Proportion of Variance 0.2128 0.08427 0.06981 0.06028 0.0526 0.04615
## Cumulative Proportion 0.2128 0.29705 0.36686 0.42714 0.4797 0.52589
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.20332 1.10727 1.05350 1.01224 0.98184 0.95678
## Proportion of Variance 0.04388 0.03715 0.03363 0.03105 0.02921 0.02774
## Cumulative Proportion 0.56977 0.60692 0.64056 0.67160 0.70082 0.72856
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.91261 0.87905 0.86427 0.83893 0.81415 0.77188
## Proportion of Variance 0.02524 0.02342 0.02264 0.02133 0.02009 0.01805
## Cumulative Proportion 0.75380 0.77721 0.79985 0.82117 0.84126 0.85932
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.7511 0.72022 0.69616 0.65429 0.6371 0.63217
## Proportion of Variance 0.0171 0.01572 0.01469 0.01297 0.0123 0.01211
## Cumulative Proportion 0.8764 0.89213 0.90682 0.91979 0.9321 0.94420
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.59015 0.58270 0.52244 0.50912 0.49183 0.46030
## Proportion of Variance 0.01055 0.01029 0.00827 0.00785 0.00733 0.00642
## Cumulative Proportion 0.95475 0.96504 0.97331 0.98117 0.98850 0.99492
## PC31 PC32 PC33
## Standard deviation 0.40766 0.03891 4.072e-16
## Proportion of Variance 0.00504 0.00005 0.000e+00
## Cumulative Proportion 0.99995 1.00000 1.000e+00
Here we get principal components named PC1-PC52. Each of these explains a percentage of the total variation in the dataset. For example, PC1 explains nearly 22% of the total variance i.e. around One-fourth of the information of the dataset can be encapsulated by just that one Principal Component. PC2 explains 8% and so on.
Next, Plotting PCA: While plotting a PCA we refer to a scatterplot of the first two principal components PC1 and PC2. These plots reveal the features of data such as non-linearity and departure from normality. PC1 and PC2 are evaluated for each sample vector and plotted.
# loading library
library(ggfortify)
#The autoplot( ) function of the ‘ggfortify package’ for plotting PCA:
pca.out.plot <- autoplot(pca.out,
data = adhd,
colour='suicide')## Warning: `select_()` was deprecated in dplyr 0.7.0.
## Please use `select()` instead.
pca.out.plotFor a better understanding of the linear transformation of features, biplot( ) function also be used to plot PCA.
biplot.pca.out <- biplot(pca.out)biplot.pca.out## NULL
For determining the ideal features which can be justified after performing PCA, the plot( ) function can be used to plot the precomp object.
plot.pca.out <- plot(pca.out, type="l")plot.pca.out## NULL
In a screeplot the ‘arm-bend’ represents a decrease in cumulative contribution. The above plot shows the bend at the second principal component.
which question may have a heavy bearing on the score : For PC1, we can say that the Question Q1a is the most important (md_q1a -0.26892181) and for PC2, we can say that the Question Q3 is the most important (md_q3 -0.236710147) along with the question on Cocaine (cocaine 0.381655078).